pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, patchwork, ggiraph, ggrepel)Take Home_Ex03
1. Background
FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.
With reference to Mini-Challenge 3 of VAST Challenge 2023 and by using appropriate static and interactive statistical graphics methods, we will be helping FishEye to better understand fishing business anomalies.
2. Data Source
The data is taken from the Mini-Challenge 3 of VAST Challenge 2023.
3. Data Preparation
3.1 Install and launching R packages
The code chunk below uses p_load() of pacman package to check if packages are installed in the computer. If they are, then they will be launched into R. The R packages installed are:
3.2 Loading the Data
fromJSON() of jsonlite package is used to import MC3.json into R environment.
mc3_data <- fromJSON("data/MC3.json")The output is called mc3_data. It is a large list R object.
3.3 Extracting edges
The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.
mc3_edges <- as_tibble(mc3_data$links) %>%
distinct() %>%
mutate(source = as.character(source),
target = as.character(target),
type = as.character(type)) %>%
group_by(source, target, type) %>%
summarise(weights = n()) %>%
filter(source!=target) %>%
ungroup()3.4 Extracting nodes
The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services) #select() used to organise the sequence of col4. Data Exploration and Data Wrangling
4.1 Exploring the edges data frame
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
skim(mc3_edges)| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The report above reveals that there is no missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.
DT::datatable(mc3_edges)The edge table provides us an understanding of the relationship between the source and targets. Here source is the Company and the relationship with the target is based on the type column. There are two kinds of relationship; beneficial owner and company contacts.
4.1.1 Plotting the variables in edge dataframe
Below is the code chunk using ggplot to plot the distribution of the following:
Distribution of the type of relationship that exist between the source and target and their corresponding frequency.
Number of companies that a beneficial owner owns
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Plot distribution of type
hist_type <- ggplot(data = mc3_edges,
aes(x = type)) +
geom_bar() +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.1) +
labs(title = "Distribution of Relationship Types", x = "Type", y = "Count") +
theme_bw() +
theme(plot.title = element_text(face = "bold"))Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
#Filter the type == "Beneficial Owner"
mc3_edges_owner <- mc3_edges %>%
filter(type == "Beneficial Owner") %>%
group_by(target, type) %>%
summarise(no_of_companies = n()) %>%
ungroup()
# Create a ggplot histogram to plot the no of companies a beneficial owner owns
gg_hist_own <- ggplot(mc3_edges_owner, aes(x = no_of_companies)) +
geom_histogram(fill = "steelblue") +
labs(title = "No of companies beneficial owners own", x = "No of companies", y = "Count") +
theme_bw() +
theme(plot.title = element_text(face = "bold")) +
scale_x_continuous(breaks = seq(min(mc3_edges_owner$no_of_companies), max(mc3_edges_owner$no_of_companies), by = 1))
# Calculate frequency counts for each bin
freq_counts <- table(mc3_edges_owner$no_of_companies)
# Create a data frame for labels
label_data <- data.frame(x = as.numeric(names(freq_counts)), y = as.numeric(freq_counts))
# Add frequency labels to the plot
gg_hist_own <- gg_hist_own +
geom_text(
data = label_data,
aes(x = x, y = y, label = y),
vjust = -0.5,
size = 3
)Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
#Combining the two plots using patchwork
combined_plot <- hist_type / gg_hist_own
combined_plot
As seen from the above plot, there are a total of 16,792 count for beneficial owners and 7,244 for Company contacts.
Also, we can see that a majority of owners owns 1 company. In fact, less than 0.5% of the beneficial owners that own more than 3 companies. This may call for suspicious and we will further investigate. We could look at the size of the company in terms of their revenues and number of owners it has (will be explored further later on).
owners with many companies but small, these will not be listed companies, listed companies usually more transparent, more owners.
4.1.2 Creating new edge dataframe
Below is the code chunk to create a new edge dataframe called mc3_edges_with_no_of_companies, which has the no_of_companies column added in.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Join the no_of_companies column from mc3_edges_owner into mc3_edges
mc3_edges_with_no_of_companies <- mc3_edges %>%
left_join(mc3_edges_owner %>% select(target, no_of_companies),
by = c("target" = "target")) %>%
mutate(no_of_companies = ifelse(is.na(no_of_companies), 0, no_of_companies))
# View the updated mc3_edges
mc3_edges_with_no_of_companies# A tibble: 24,036 × 5
source target type weights no_of_companies
<chr> <chr> <chr> <int> <dbl>
1 1 AS Marine sanctuary Christina Taylor Compa… 1 1
2 1 AS Marine sanctuary Debbie Sanders Benef… 1 1
3 1 Ltd. Liability Co Cargo Angela Smith Benef… 1 1
4 1 S.A. de C.V. Catherine Cox Compa… 1 0
5 1 and Sagl Forwading Angela Mendoza Compa… 1 0
6 1 and Sagl Forwading Christopher Watson Benef… 1 1
7 2 Limited Liability Company Amanda Mcdonald Benef… 1 1
8 2 Limited Liability Company Megan Padilla Compa… 1 0
9 2 Limited Liability Company Monica Martinez Compa… 1 0
10 2 Limited Liability Company Teresa Collins Benef… 1 1
# ℹ 24,026 more rows
4.2 Exploring the nodes data frame
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame.
skim(mc3_nodes)| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
There are a large number of missing values in the revenue_omu column and the column is treated as
# Convert the "revenue_omu" column to numeric in mc3_nodes
mc3_nodes <- mc3_nodes %>%
mutate(revenue_omu = as.numeric(revenue_omu))In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.
DT::datatable(mc3_nodes)Observing the nodes datatable above, we will notice that some of the node ids are not unique, some may have more than 1 country, offer more than 1 product services and/or more than 1 revenue reflected.
4.2.1 Handling of missing and/or unknown values
Notice that the product services column contains NA or character(0) values, which are meaningless, thus replace it with “unknown”. As for revenue_omu column that has NA values, replace it with the value “0”.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
mc3_nodes <- mc3_nodes %>%
mutate(product_services = ifelse(product_services == "character(0)", "unknown", product_services),
revenue_omu = ifelse(revenue_omu == "" | is.na(revenue_omu), "0", revenue_omu))4.2.2 Checking for duplicate nodes and removing them
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Calculate the number of duplicates in mc3_nodes
num_duplicates_nodes <- sum(duplicated(mc3_nodes))
# Display the number of duplicates
num_duplicates_nodes[1] 2595
Show the code
# Remove duplicates from mc3_nodes
mc3_nodes_unique <- distinct(mc3_nodes)There are a total of 2595 duplicated nodes. These duplicated nodes are removed and a new nodes dataframe, mc_nodes_uniquedataframe is created.
4.2.3 Distribution of the type of nodes
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
hist_type_node <- ggplot(data = mc3_nodes_unique,
aes(x = type)) +
geom_bar()+
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.1) +
labs(title = "Distribution of Node Type", x = "Type", y = "Count") +
theme_bw() +
theme(plot.title = element_text(face = "bold"))
#hist_type_node4.2.4 Distribution of the product_services
In this section, we will perform text sensing using appropriate functions of tidytext package.
To begin, we will employ the tokenisation process. In text sensing, tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.
In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.
token_nodes <- mc3_nodes_unique %>%
unnest_tokens(word,
product_services)The two basic arguments to unnest_tokens() used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).
By default, punctuation has been stripped. (Use the to_lower = FALSE argument to turn off this behavior).
By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).
Now we can visualise the words extracted by using the code chunk below.
token_nodes %>%
count(word, sort = TRUE) %>%
top_n(5) %>%
mutate(word = reorder(word, n)) # A tibble: 5 × 2
word n
<fct> <int>
1 unknown 21009
2 and 6389
3 products 1860
4 of 881
5 as 752
token_nodes# A tibble: 80,858 × 5
id country type revenue_omu word
<chr> <chr> <chr> <chr> <chr>
1 Jones LLC ZH Company 310612303.447 automobi…
2 Coleman, Hall and Lopez ZH Company 162734683.9969 passenger
3 Coleman, Hall and Lopez ZH Company 162734683.9969 cars
4 Coleman, Hall and Lopez ZH Company 162734683.9969 trucks
5 Coleman, Hall and Lopez ZH Company 162734683.9969 vans
6 Coleman, Hall and Lopez ZH Company 162734683.9969 and
7 Coleman, Hall and Lopez ZH Company 162734683.9969 buses
8 Aqua Advancements Sashimi SE Express Oceanus Company 115004666.6728 holding
9 Aqua Advancements Sashimi SE Express Oceanus Company 115004666.6728 firm
10 Aqua Advancements Sashimi SE Express Oceanus Company 115004666.6728 whose
# ℹ 80,848 more rows
The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “and” and “to”. In the word of text mining we call those words stop words. Tidytext package has a function called stop_words that can help us clean up stop words.
stopwords_removed <- token_nodes %>%
anti_join(stop_words)There are two processes:
Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis..
Then
anti_join()of dplyr package is used to remove all stop words from the analysis..
stopwords_removed %>%
filter(!word %in% c("unknown", "services", "related","including", "offers","range")) %>% #filter away meaningless words
count(word, sort = TRUE) %>%
top_n(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in product_services field")
The below code chunk will better help us categorise our product_services for analysis into fishing related, non-fishing related and unknown.
#Create a list of fishing related words
include_words <- c("fish", "fishing", "seafood", "seafoods","prawns","prawn", "salmon","tuna","shrimp","shrimps","crab","squid","oyster","clam","mollusks","crustaceans","roe","fillet","haddock","octopus","herring","lobsters","seabass","cephalopods","cod","shellfish","shark","chum")
#Use the grepl() function to create a logical vector indicating whether each word in mc3_nodes_unique$product_services is found in the include_words list. Store the result in a new column called category
mc3_nodes_unique$category <- ifelse(grepl(paste0("\\b", paste(include_words, collapse = "\\b|\\b"), "\\b"),
tolower(mc3_nodes_unique$product_services)),
"Fishing-related",
ifelse(mc3_nodes_unique$product_services == "Unknown",
"Unknown",
"Non-fishing related"))Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
library(dplyr)
library(ggplot2)
library(ggrepel)
# Define the colors for each category
category_colors <- c("Fishing-related" = "#B4D4E7", "Non-fishing related" = "#B4E7BD", "Unknown" = "#D3D3D3")
# Set the category as a factor with desired order
category_freq <- mc3_nodes_unique %>%
mutate(category = factor(category, levels = c("Fishing-related", "Non-fishing related", "Unknown"))) %>%
count(category) %>%
mutate(percentage = prop.table(n) * 100)
# Create a pie chart with labels
ggplot_cat <- ggplot(category_freq, aes(x = "", y = n, fill = category)) +
geom_bar(width = 1, stat = "identity", color = "black") +
coord_polar(theta = "y") +
xlab("") +
ylab("") +
labs(title = "Distribution of Category") +
theme_void() +
theme(legend.position = "right",
plot.title = element_text(hjust = 0.5, face = "bold")) +
geom_label_repel(aes(label = paste0(category, "\nCount: ", n, "\n", round(percentage, 1), "%")),
box.padding = 0.5,
point.padding = 0.1,
segment.color = "black",
show.legend = FALSE,
label.color = "black") +
scale_fill_manual(values = category_colors)Looking at the above pie chart, we can see that companies that offer fishing-related products only take up a small percentage, around 4% and the rest are either non-fishing related or unknown.
Finding the median revenue for each category
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
library(dplyr)
library(ggplot2)
#Convert revenue_omu to numeric
mc3_nodes_unique <- mc3_nodes_unique %>%
mutate(revenue_omu = as.numeric(revenue_omu))
# Define the colors for each category
category_colors <- c("Fishing-related" = "#B4D4E7", "Non-fishing related" = "#B4E7BD", "Unknown" = "#D3D3D3")
# Calculate the median revenue_omu for each category
median_revenue <- mc3_nodes_unique %>%
group_by(category) %>%
filter(category != "Non-fishing related" | (category == "Non-fishing related" & revenue_omu != 0 & !is.na(revenue_omu))) %>%
summarize(median_revenue_omu = median(revenue_omu, na.rm = TRUE))
# Plot the bar chart
ggplot_rev <- ggplot(median_revenue, aes(x = category, y = median_revenue_omu, fill = category)) +
geom_col() +
scale_fill_manual(values = category_colors) +
xlab("Category") +
ylab("Median Revenue (OMU)") +
labs(title = "Median Revenue by Category") +
theme_bw() +
theme(plot.title = element_text(face = "bold")) +
geom_text(aes(label = round(median_revenue_omu, 2)), vjust = -0.5)Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
combined_plot2 <- ggplot_cat / ggplot_rev
combined_plot2
5. Network Visualisation and Analysis
5.1 Building network model with tidygraph for Beneficial Owners
Based on our edge dataframe analysis earlier on, we found out that less than 0.5% of the beneficial owners own more than 3 companies, which calls for suspicion, thus we will further investigate.
Preparing edge data table
#filter those beneficial owners that has more than 3 companies
filtered_mc3_edges_owner <- mc3_edges_with_no_of_companies %>%
filter(no_of_companies > 3, type == "Beneficial Owner")
filtered_mc3_edges_owner# A tibble: 313 × 5
source target type weights no_of_companies
<chr> <chr> <chr> <int> <dbl>
1 Acevedo, Dickson and Gonzalez Richard Smith Bene… 1 6
2 Adams Group John Smith Bene… 1 9
3 Adams-Pope Michelle Rodr… Bene… 1 4
4 Adriatic Catch S.A. de C.V. David Jones Bene… 1 6
5 Albertine Rift NV Family Michael Taylor Bene… 1 4
6 Alexander PLC David Jones Bene… 1 6
7 Alvarez Ltd Michael Carter Bene… 1 5
8 Alvarez, Young and Ramos Michael Miller Bene… 1 5
9 Ancla del Este Ltd. Liability Co Aaron Jones Bene… 1 4
10 Ancla del Este Sp Fish John Jones Bene… 1 4
# ℹ 303 more rows
Preparing nodes data table
Instead of using the nodes data table extracted from mc3_data, we will prepare a new nodes data table by using the source and target fields of filtered_mc3_edges_owner data table. This is necessary to ensure that the nodes in nodes data tables include all the source and target values.
Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Create a data frame with source nodes and rename column
id1 <- filtered_mc3_edges_owner %>%
select(source) %>%
rename(id = source) %>%
mutate(type_node = "company")
# Create a data frame with target nodes and rename column
id2 <- filtered_mc3_edges_owner %>%
select(target, type) %>%
rename(id = target, type_node = type)
# Combine the two data frames and remove duplicates
mc3_nodes1 <- rbind(id1, id2) %>%
distinct()
#see if need add in some of the nodes detail
mc3_nodes1# A tibble: 362 × 2
id type_node
<chr> <chr>
1 Acevedo, Dickson and Gonzalez company
2 Adams Group company
3 Adams-Pope company
4 Adriatic Catch S.A. de C.V. company
5 Albertine Rift NV Family company
6 Alexander PLC company
7 Alvarez Ltd company
8 Alvarez, Young and Ramos company
9 Ancla del Este Ltd. Liability Co company
10 Ancla del Este Sp Fish company
# ℹ 352 more rows
Tidygraph model
mc3_graph <- tbl_graph(nodes = mc3_nodes1,
edges = filtered_mc3_edges_owner,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
# Preparing edges tibble data frame
edges_df <- mc3_graph %>%
activate(edges) %>%
as.tibble()
# Preparing nodes tibble data frame
nodes_df <- mc3_graph %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(everything()) %>%
relocate(id, .before = label)
nodes_df <- nodes_df %>%
rename(group = type_node)
# Plot the network graph with labeled nodes using visNetwork
visNetwork(nodes_df, edges_df, main = list(text = "Network Graph of Company and Beneficial Owner",
style = "color: black; font-weight: bold; text-align: center;")) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visLayout(randomSeed = 123) %>%
addFontAwesome(name ="font-awesome") %>%
visGroups(groupname = "company", shape = "icon",
icon = list(code = "f0f7", color = "#000000")) %>%
visGroups(groupname = "Beneficial Owner", shape = "icon",
icon = list(code = "f2bd")) %>%
visLegend() %>%
visOptions(
highlightNearest = TRUE,
nodesIdSelection = TRUE,
) %>%
visInteraction(
zoomView = TRUE,
dragNodes = TRUE,
dragView = TRUE,
navigationButtons = TRUE,
selectable = TRUE, # Enable node selection
hover = TRUE, # Enable hover effects
)Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
# Set a seed for reproducibility
set.seed(123)
ggraph_own <- mc3_graph %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph()
ggraph_own
5.2 Building network model with tidygraph for Company Contacts
Similarly, to plot the network graph of Company and Company Contacts, we do the same as above,
Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
#Filter the type = "Company Contacts" to create the edge data table
mc3_edges_cc<- mc3_edges_with_no_of_companies %>%
filter(no_of_companies > 3, type == "Company Contacts")
# Create the nodes data table
# Create a data frame with source nodes and rename column
id3 <- mc3_edges_cc %>%
select(source) %>%
rename(id = source) %>%
mutate(type_node = "company")
# Create a data frame with target nodes and rename column
id4 <- mc3_edges_cc %>%
select(target, type) %>%
rename(id = target, type_node = type)
# Combine the two data frames and remove duplicates
mc3_nodes2 <- rbind(id3, id4) %>%
distinct()
#Building the tidygraph model for company contacts
mc3_graph2 <- tbl_graph(nodes = mc3_nodes2,
edges = mc3_edges_cc,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = centrality_closeness())Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
# Preparing edges tibble data frame
edges_df_2 <- mc3_graph2 %>%
activate(edges) %>%
as.tibble()
# Preparing nodes tibble data frame
nodes_df_2 <- mc3_graph2 %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(everything()) %>%
relocate(id, .before = label)
nodes_df_2 <- nodes_df_2 %>%
rename(group = type_node)
# Plot the network graph with labeled nodes using visNetwork
visNetwork(nodes_df_2, edges_df_2, main = list(text = "Network Graph of Company and Company Contacts",
style = "color: black; font-weight: bold; text-align: center;")) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visLayout(randomSeed = 123) %>%
addFontAwesome(name ="font-awesome") %>%
visGroups(groupname = "company", shape = "icon",
icon = list(code = "f0f7", color = "#000000")) %>%
visGroups(groupname = "Company Contacts", shape = "icon",
icon = list(code = "f0c0")) %>%
visOptions(
highlightNearest = TRUE,
nodesIdSelection = TRUE,
) %>%
visLegend() %>%
visInteraction(
zoomView = TRUE,
dragNodes = TRUE,
dragView = TRUE,
navigationButtons = TRUE,
selectable = TRUE, # Enable node selection
hover = TRUE, # Enable hover effects
)Show the code
#| echo: false
#| fig-width: 4
#| fig-height: 4
# Set a seed for reproducibility
set.seed(123)
mc3_graph2 %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_size_continuous(range=c(1,10))+
theme_graph()
5.3 Adding the mc3_nodes_unique attributes
Here we will consider both beneficial owners and company contacts,
filtered_mc3_edges <- mc3_edges_with_no_of_companies %>%
filter(no_of_companies > 3)Show the code
#| echo: false
#| fig-width: 3
#| fig-height: 3
# Create a data frame with source nodes and rename column
id4 <- filtered_mc3_edges %>%
select(source) %>%
rename(id = source) %>%
mutate(type_node = "company")
# Create a data frame with target nodes and rename column
id5 <- filtered_mc3_edges %>%
select(target, type) %>%
rename(id = target, type_node = type)
# Combine the two data frames and remove duplicates
mc3_nodes3 <- rbind(id4, id5) %>%
distinct() %>%
left_join(mc3_nodes_unique,
unmatched = "drop") %>%
distinct()
mc3_nodes3 <- mc3_nodes3 %>%
mutate(revenue_omu = ifelse(revenue_omu == "" | is.na(revenue_omu), "0", revenue_omu))
# Convert the revenue column to numeric (if it's not already numeric)
mc3_nodes3$revenue_omu <- as.numeric(mc3_nodes3$revenue_omu)
# Calculate the revenue threshold for the top 20% excluding non-numeric or missing values
revenue_threshold <- quantile(mc3_nodes3$revenue_omu, probs = 0.90, na.rm = TRUE)
# Filter the DataFrame to retain only the rows with revenue above the threshold
filtered_mc3_nodes <- mc3_nodes3[mc3_nodes3$revenue_omu > revenue_threshold, ]
# View the filtered DataFrame
filtered_mc3_nodes# A tibble: 54 × 7
id type_node country type revenue_omu product_services category
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Ancla del Este… company Uzifri… Comp… 130212. Operation of fi… Fishing…
2 Andhra Pradesh… company Rio Is… Comp… 787121. Grocery products Non-fis…
3 Bahía de Plata… company Novarc… Comp… 60335. Fabricated meta… Non-fis…
4 Bahía del Este… company Oceanus Comp… 254667. Swimwear and fa… Non-fis…
5 Bahía del Sol … company Novarc… Comp… 98065. Contract manufa… Non-fis…
6 Bahía del Sol … company Utopor… Comp… 67616. Gelatin Non-fis…
7 Baker and Sons company ZH Comp… 104095830. Fish; fresh or … Fishing…
8 BlueWaterBites… company Zawali… Comp… 199596. Canned Products… Non-fis…
9 Bu yu wang AG company Nalako… Comp… 62860. Gelatine produc… Non-fis…
10 Congo Rapids … company Riodel… Comp… 106161. Writing tools a… Non-fis…
# ℹ 44 more rows
Show the code
#| echo: false
#| fig-width: 5
#| fig-height: 6
# Create a bar chart of revenue vs ID using ggplot
bar_plot_toprev <- ggplot(filtered_mc3_nodes, aes(x = reorder(id, revenue_omu), y = revenue_omu/1000)) +
geom_bar_interactive(aes(tooltip = paste("ID:", id,
"<br>Type:", type_node,
"<br>Country:", country,
"<br>Revenue:", revenue_omu,
"<br>Product Services:", product_services)),
stat = "identity", fill = "steelblue") +
labs(x = "id", y = "Revenue_omu ('000)", title = "Top 10% ids") +
coord_flip() +
theme(plot.title = element_text(face = "bold"))+
theme(axis.text.y = element_text(size = 6))
# Print the bar plot
girafe(ggobj = bar_plot_toprev,
width_svg = 8,
height_svg = 8*0.618)